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ABSTRACT 



In this paper, the approach is to study an estimator of distribution free 
and to design source program which might be useful. This distribution free 
estimator, super-parametric density estimator, and its related algorithm were 
suggested (Tsai et al. 2007). Though we will focus on the implementation, 
the computer programming, of the algorithm and strategies of choosing window 
functions, the consistency of the of the estimator is studied and the window 
functions such as B-spline, Bezier spline and piecewise Bezier spline are studied 
as well. Since the algorithm is designed for solving the optimization of likelihood 
function, there is a set of nonlinear equations with a large number of variables. 
The numerical results show that algorithm is very powerful and effective in the 
sense of mathematics, that is, the iteration procedures converge and the rate of 
convergence is very fast. Though it is not main purpose to study the consis- 
tency of the estimator, the approach in this paper to attain the consistency is 
straightforward and comprehensive. From the numerical examples, the reader 
can find how to use this new theory and new methods of density estimation. 
The fortran source programs are appended in this paper. 

1. INTRODUCTION 

Though it might not be a nice approach to learn statistics from the appli- 
cation, the authors started to learn statistics from computer science. There 
are good surveys of density estimation in the textbook (Duda and Hart 1973). 
The authors studied nonparametric density estimation, Parzen-window, from 
this textbook and penalized likelihood method from papers (Good and Gask- 
ins 1971,1980). We think that parametric approach is well established method. 
Therefore, we try to combine the theories and techniques of both nonparametric 
and parametric approaches. There are two problems must be solved. The first 
one is the consistency of the estimator and the second one is the nonlinear opti- 
mization. Though it is a hard work to encode and to debug a source program, 
it is worthwhile to try a new theory and a new method. Originally, we designed 
a fortran source program to test the algorithm of nonlinear mathematical pro- 
gramming and to test the function of splines. Due to the powerful theorem, 
Bernstein polynomial and Stone- Weierstrass theorem, the results are so good 
that are beyond our imagination, especially, the continuous case. 

2. PARZEN WINDOW AND SUPER-PARAMETRIC ESTIMA- 
TOR 

In order to solve the second problem, we model the problem by intuition. 
Let 6 be Dirac delta function. Let / be the density function. Clearly, it is that 



Here, / will be estimated by observationsxi,X2,£3v . . ,x m . The integration 
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is replaced by summation. Let / be the estimator of /. Let f(x) = Ciifi(x) 

i=i 

where tpi are window functions, < <fi(x), f ipi(x)dx = Pi and pi < oo. Let 

m 

/ = Yi f( x j) be the likelihood function. Now the problem is to maximize I 

n 

subjected to the constraints^ PiC-i = 1 and < c$, i — 1,2, ...,n. Mathemati- 

t=i 

n 

cally, since a are going to be determined, if we redefine f(x)= (. c i/Pi)<Pi( x ) 

i=l 

n 

, then the constraints become Y] Cj = 1 and < Cj, i = 1,2, ...,rt. Naturally, 

i=l 

this estimator is called super-parametric estimator. We have noticed that most 
nonparametric documents give the note: If the density function is the (linear) 
combinations of window functions, then Dirac delta functions shall be obtained 
when maximum likelihood estimator is applied and hence undesirable rough- 
ness will be introduced. We think, by choosing window function carefully, the 
roughness can be avoided. Before we discuss how to choose the window function, 
we will quote some results of nonlinear optimizations designed (Tsai et al. 2007) . 

3. THE ITERATION PROCEDURES 

n _ 

Let f(x) — ^2 u i v i<Pi{ x )i where ipi are the window functions. Let / = 

t=i 

m _ 

f| f(xj) be the likelihood function. Now the problem is to maximize I subjected 
to the constraints 

n 

^ Uim = r. (1) 

i=l 
n 

^ ViVi = r. (2) 

i=l 

Let I and I be the likelihood functions defined above. Let A be the set of all 
I. Let A be the set of all I. It is obvious that A C A. Therefore, the maximum 
of A is less than or equal to that of A. It was shown that the extreme points 
of I should be located at the points such that Ui = Vi, i = 1, 2, n. Therefore, 
the problem to maximize I subjected to its constraint is equivalent to that of 
maximizing I subjected to the constraints (p} and |(2J). Instead of solving the 
problem directly, the iteration procedures are constructed. 

The procedures are: 

Step (i). Initialize the procedure by setting k — 1 and v\ — y^r/n, i = 
l,2,...,n. 

Step (ii). maximize I subjected to the constraint {Ip. Then values of uf,i = 
1,2,..., n, are obtained. 
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Step (iii). Check the condition ^ufvf + e>r is satisfied or not, where 

i— 

e is a small positive number for controlling the termination of the procedures. 
If the condition is satisfied, then stop the iteration procedures and the density 

n 

estimator, f(x) = ^2 u i v i i Pii. x> )) * s obtained. Otherwise, increase the value of 

i=l 

k by one, set vf = 9\J u^ 1 v l [ ~ 1 , here 9 is a chosen constant for satisfying 
constraint H|). Then go to Step (ii) and proceed the procedures. 

It is not so easy to complete step (ii), because nonlinear optimization is 
very complicated usually. Let ipi(x) = vfipi(x). The simple notation, f(x) = 

n 

^2 Uiipi(x) shall be used and the superscript of the symbols u\ and v\ shall be 

dropped hereafter. Let 6^ = ipi(xj). Let Dij — (r/m)(bj • hj). Let u and bjbe 

n components vectors, where u = \u\, tia, .., and hj = [bij, b%j, b n j] . 
tci m 

Then u = (r/m) ^2 oybj, where ^2 DkjCtkCtj = 1, k = 1,2, ...,m. And hence 

3=1 3=1 

step (ii) is executed completely if the values of all at are found. The proce- 
dures has been designed and studied by (Tsai et al. 2007), especially, the most 
complicated step, step (ii), is studied completely. Some of them are listed in 
the appendix A. Since it has been studied, we will not discuss the details of 
the procedures in this paper. In order to avoid introducing the roughness, the 
splines are chosen as the window functions. 



4. THE CONSISTENCY OF THE ESTIMATOR 



In order to make the approach more comprehensive, we prefer to use the 
same mathematical notation as elementary calculus. Though the upper case 
letter Xi and the lower case letter Xi are associated with different meanings, 
we try to use lower case letter as much as possible. If we are going to study 
the estimation density function which is distribution free, we may assume that 
the density function, /, be a measurable function defined on its domain. Let 
gni(x) = 1 for some interval (a n i,b n i) and g n i(x) — otherwise. Then there is 
a sequence f n such that lim f n (x) — fix) almost everywhere, where f n (x) = 

n — >oc 

m 

^2 9^g n i(x). If we impose some conditions on /, then it might be possible 

i=l 

that lim f n (x) — f(x) uniformly. For any e > 0, there is f n such that 

n — >oo 

f(x) - f(x) < \f(x) - f n {x)\ + f n (x) - f(x) . Here / is an estimator of /. 
Though / and /„ are unknown, we can estimate / if all g n i are known. Let 

^ m 

f( x ) = 12 0i9ni{x). If 9i, i = 1,2, m, are parameters which are going to be 

determined, then this is a simple parametric problem. Roughly speaking, the 
nonparametric estimator can be transferred to a parametric estimator and hence 
we call it super-parametric estimator. There are many contributors (Wald 1949) 
who have proved the consistency of the maximum likelihood estimators. The 
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only problem that we need to solve is a nonlinear optimization problem. If the 
density function,/, is continuous on closed interval [a, b], then Bernstein poly- 
nomials play an important role in density estimation. Here, we assume that it is 
well known that there are strong connections among the Bezier spine, Bernstein 
polynomial and Stone- Weierstrass theorem. Let / be the density function which 

^ n 

is defined and continuous on [0,1]. Let f(x) — ^i^iix) be super-parametric 

i=l 

n 

estimator. Let B n {x) = f{xi)C™x l (l — x) n ~ l be Bernstein polynomial which 

i=0 

is associated with the density function /, where C" = n\/(i\(n — and 
Xi = i/n. Let ipi(x) — NiC^x 1 ^ — x) n ~ l , here TV, is a constant to make 
/ (fi(x)dx = 1. Let 9® — f(xi)/Ni. From the property of Bernstein polynomial, 
we have 



m - m 



< 



(3) 



i=0 i=0 

Since the first term in right hand side of the inequality can be handled, it 
seems that the problem of density estimation becomes a problem of curve fitting. 
Of course, it is not so simple actually. 

In order to make our approach more comprehensive, we use the advantages 
of mathematical notations. Let 

n n 

S n = {g; g(x) = ^ m {x), ^ fl< = 1, 0< > 0, x e £>}. 

.Here ifi is a normalized nonncgative function, for example, the window 
function obtained from Bernstein polynomials and D is the domain of func- 
tions. Let / be the density function which is going to be estimated by a set of 
samples. If / € S n , we are so lucky, then the consistency of super-parametric 
estimator had been proved (Wald 1949). Generally speaking, it is impossible to 
infer the uncountable information of a density function by a finite set of sam- 
ples without sufficient assumptions and strong intuition. It should be allowed 
to approximate the density function by all means. Let X±, X2,. . . , X m be in- 
dependent identically distributed from a distribution F of which the density is 
/. Let G be another distribution of which the density is g and / 7^ g. Let 
^su P = {x; f(x)g(x) 7^ 0}. For simplicity, we assume that the integral is taken 
on A sup . It is obvious that 

J \og{^)dF < log( J ^dF) = 0. (4) 

since the second derivative of — log is positive and hence — log is convex. From 
the law of large number, if the number m is large enough, then we have 

1 m -. m 

-^log ff (X i )<-^log/(X i ). (5) 

i=i 1=1 
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From (0, intuitively, it is a reasonable approach to approximate / by go if 

m m 

Iq = Yl 9o{ x j) an d is the maximum of L, where L = {I; I = Y[ g{xi), 9 G S n }. 

3=1 »=i 
The existences of go and Zo are doubtless since the set of Xi are known and hence 

I is a continuous function defined on compact set f2, 

n 

fi = #2> •••> > 0, &i — !}■ We will not discuss the details here. 

i=l 

The further study is discussed in appendix B. 
5. B-SPLINE AND BEZIER SPLINE 

Though normal distribution is a good candidate for window function (Duda 
and Hart 1973), we try to use the splines as window function in this paper. 
Without the Taylor's series, most useful function might not be useful. Without 
the power series, there would be no special functions which are used in classic 
physics and quantum mechanics. In the practical problem, the power series 
shall be replaced by polynomial. Spline is a synonym of polynomial. There 
were many splines which were developed in last century (Newman and Sproull 
1979, Quarteroni Sacco and Saleri 2000). B-spline is one of the most useful 
splines. In computer graphic, we will call them blending functions instead of 
window functions. According to the degree of blending polynomial functions, 
the blending function is denoted by the symbol Ni t k and defined as follows: 

N it x(x) = 1 if t t < x < t i+ i, 

Ni i(x) — otherwise. We define them recursively, 

, T , , (x - ti)Ni,k-i(x) , (t i+k -x)N i+lk _i(x) 

J^i,k\x) = 1 ! . 

ti+k-l — ti U+k — ti+l 

Here, we have the convention, 0/q = 0, and the set of knot values U, 

i = 0, 1, 2, n + fc, are defined 

ti = xq if i < k. 

ti = Xi_k+i h" k < i < n. 

k = if i > n. 

The estimator shall be defined the linear combinations of the blending func- 

tions, that is f(x) = ^2 CiN^k(x), where k will be chosen for controlling the 

order of continuity. Indeed, we can find that they are the simple window func- 
tions used in Parzen approach when k = 1, though they are not centrally lo- 
cated. In order to use the B-spine, the observations should be reordered such 
that xo < xi, ...,< x m - In B-splinc method, we put m = n. 

Bczier spines are much simpler than B-spines because there are no extra 
knot points in Bezier spline. We have listed them already and they are 

Bi, n {x) = " - x l (l-x) n -\i = 0,l,...,n, 0<x<l. 
i\\n — 1)\ 

Since x could be defined on the specified interval, the scaling must be done for 
individual problem at hand. It should be emphasized that the algorithm (sup- 
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porting document 2007) requires that each observation data Xj there must be 
some ipi such that fi(xj) > 0. If the Gauss distributions, long tail distributions, 
are chosen to be the window functions, then the requirement is satisfied auto- 
matically. Since we are in favor of splines, the requirement makes the computer 
programming more complicated, especially, in B-spline. In many applications, 
B-spline has more advantages than Bezier spline. Therefore, we started encod- 
ing the program for choosing as the window functions the B-spline. In most 
documents, B-splines are defined on a parameter or parameters, more precisely, 
B-spline curves and surfaces are define by one and two parameters respectively. 
All the information of the B-splinc that we get is the B-spline of low order with 
uniform spaced knot points. In this paper, the knot points are the observa- 
tions. Of course, the set of observation shall be sorted and assigned to be the 
knot points. Unlike the Parzen-window functions, the sizes and the shapes of 
the window functions are different from one to another. And this makes some 
difficulties, for example, the area of some window functions might be zero or 
very small numbers and hence it is impossible to normalize the window func- 
tions in numerical computation. These difficulties can be handled by symbolic 
computation of integral by using some computer software such as Mathematica, 
Maple etc. From the figures, Figure 1 and Figure 2, we can find some effects 
of non-equalized window functions, especially at two end points. Therefore, the 
interval on which the windows are defined is extended in both end points. 

6. THE PIECEWISE BEZIER SPLINE 

Bezier spline with order 10, more or less, is good enough to estimate any 
continuous unimodal density function since it has 11 parameters to control the 
curve. If it is necessary, then the order can be increased to 30. Classification 
plays an important role in many application fields, classification by the features 
of different species. Samples of pattern classification are not obtained from a 
single distribution. If the density is well-defined, then it might the mixture of 
different densities, so to speak. Naturally, the piecewise Bezier spline is a nice 
candidate for estimator. Moreover, the consistency we have discussed is the 
case of continuous density function. If the case we study is not continuous, then 
we should modify it to fit the real case. In order to implement the piecewise 
Bezier spline, the domain of distribution shall be partitioned into subdomains 
if it is necessary. After the domain having been partitioned, the estimator is 
the sum of Bezier splines which are defined on the subdomains. Since the data, 
the samples, is the set of finite elements, the problem is to design a simple com- 
puter program to partition domain into subdomains. The algorithm is based on 
searching for tails in the middle. In order to collaborate with fortran source pro- 
gram, we use the array notation instead of subscript index. The procedures are: 
(PI) Sort the samples to obtain the ordered samples, say, x(i), i = 0, 1, 2, m. 
(P2) compute the (random) intervals t(i), t(i) — x(i + 1) — x(i), (P3) Find the 
maximum of t(i), say t(ia). (P4) Test the condition for partitioning the domain 
into two subdomains. Here, we assume that the sample size m = 180. The 
condition is m /g < io < ^ m /g and 30 < io < (m — 30). If the condition is satis- 
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fied, then the interval [x(io), x(io + 1)] is removed and the domain is partitioned 
into to subdomains, [x(0), x(io)] and [x(io + l),x(m)] and hence the set of t(i), 
not the samples, is divided into two subsets, set t(io) = indicator. In order 
to make the program workable, the location of must be stored in an array, say 
lo(ik), in the fortran program. And the value of t(io) is set to be indicator, say, 
t(io) = —1 Otherwise, that is, z > (5m)/6 or i < m/6, the value of t(io) is set 
to be zero. The setting must be done to avoid being reselected. (P5) Test the 
condition for terminate the algorithm. The condition is all the length between 
two adjacent indicators is less than, say, m/6. If the condition is satisfied, then 
terminate the algorithm. Otherwise, go to procedure (P3). After the algorithm 
being executed, the domain may be one piece interval or partitioned into several 
disconnected subintervals. 

7. NUMERICAL EXAMPLES 

In this paper, there are three numerical examples : Example 1 is an uni- 
modal distribution and the probability density function is exp(-x). Example 2 
is bimodal and the probability density function is defined on [0,4] , f(x) = 2/3 
when 1 < x < 2 ; f(x) = 1/3 when 3 < x < 4 ; otherwise f(x) = 0. Example 3 
is a trimodal distribution and the probability density function is defined on [0, 4] 
, f{x) = 1 when < x < 1/2 ; f(x) = 1/2 when 1 < x < 3/2 ; f(x) = 1/2 when 
3 < x < 7/2 ; otherwise f(x) = 0. Before starting to design the fortran source 
program for testing the formulation of this paper, we should study the character 
of B-spline and Bezier spline. Figure 1 to Figure 6 are the numerical results 
of this paper. There are three methods, B-spline with order of continuity 12, 
Bezier spline and piecewise Bezier spline. It is meaningless to use large number 
of windows and hence the number of windows is reduced to 11 when the sample 
size is 30. 

Though the consistency that we have studied is the distribution with con- 
tinuous density function, the numerical examples we apply are not confined in 
the continuous density function. Therefore, some figures are not good enough. 
But they are acceptable. 

8. DISCUSSION AND CONCLUSION 

Comparing to the existent results (Dong and Wetes 200, Good and Gaskins 
1971, Parzen 1962 and Rosenblatt 1956), the super-parametric approach is a 
method with potential. In the continuous case, it seems that the concavity of 
estimator, Bezier spline with order 10, is almost the same as that of the density 
function. So far, we think it might be a coincidence since the concavity of a 
function concerns with the second derivative of the function and the likelihood 
function has nothing concerning the derivative of any function. Roughly speak- 
ing, the order of Bezier spline should be less than 10 otherwise it will violate 
the spirits the piecewise polynomials. Due the high degree of global polyno- 
mial, the adjoin properties, oscillations will be introduced. This is a drawback 
of global polynomial. If global polynomial works well, then it is not necessary 
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to design new splines such as cubic spline, B-spline etc., since we have had La- 
grange polynomial. In order to apply Stone- Weierstrass theorem, the piecewise 
Bezier spline shall be adopted. Though Bezier spline can be jointed by pieces 
in computer aid design, the joint of Bezier spline in density estimator should be 
carefully treated because the window function in both ends have only one side 
tails. This will introduce the biases of samples implicitly. Therefore, we hope 
that some new flexible spline should be designed and studied. 

In this paper, we focus on the splines instead of general super-parametric 
approach. Though there might be some better window functions, it seems that 
Bezier spline is a nice candidate for window functions of the super-parametric es- 
timators. The programs were designed for testing. Therefore, they are designed 
by bottom up and hence they are not readable. After having been tested, some 
documents were inserted in the programs and they become readable. There- 
fore, we decide to attach the source programs in this paper. It is very easy or 
trivial to generalize this approach to multivariate distributions. In multivariate 
distributions, the coordinates of samples shall be transformed to principle di- 
rection axis. This can be done by diagonalizing the covariance matrix. After 
the transformation, the new coordinates of samples are projected into the axis 
accordingly and hence spline of higher dimension can be constructed by taking 
the product of one dimension spline on each axis (Newman and Sproull 1979). 

If all the shape of window functions are the same and each one window 
covers only one sample, then we get the same result as Parzen's approach and 
hence the consistency has been proved. From the figures, we find that B-spline 
estimator is not so good as the others. It must be confessed that we do not 
use B-spline appropriately since we just assign the samples to knot points. If 
we should choose the knot points carefully the results might be better. Due 
to some reason, the roughness of estimator, we do not try to improve choosing 
knot points. Though the results obtained from B-spline is not good enough it 
might be useful if we follow the approach of Bayes. Like Bayes learning, we may 
consider uniform distribution as priori density, the different procedures which 
we use are processes of learning and the results obtained from the piecewise 
Bezier as posteriori density. 



APPENDIX A 



In order to solve the nonlinear equations ^ DhjCtkCtj = 1, k = 1, 2, to, the 

iteration procedures are constructed. First, initialize the procedure by setting 
a k = \Jl/{Dm) where D is the maximum of Dij. Then start the iteration 
procedures: 



Step (a). Compute Ei 



,z = 1,2,..., to, and E = J2 Ei- 

»=1 



J2 DijaiOij - 1 

Step (b). Test the condition whether E < 6 is satisfied or not, where S 
is a small positive number for terminating the procedures. If E < 5, then the 



9 



desired results are obtained. Compute Uk by the identity — (r/m) ^2 otjb k j, 

j i 

k = 1,2, ...,n. And stop the iteration. Otherwise, 

Step (c), Find the largest element of the set of all E{. Suppose that the 
largest element is E k for some k. Eliminate E k by updating the value of a k by 
«fc> a k = (~ s + V s2 + 4D kk )/2D kk . And go to Step (a). 



APPENDIX B 

n n 

Let S n = {g : g{x) = E^.W.E^M.^Mefl}. Let tpi be 

i=0 i=0 

bounded, that is, |</2j(a;)| < M. Though it is impossible to get the informa- 
tion of / completely, we may assume that there is a function g, g S S n such 
that — g{x)\ < s. From © we have, 



f(x)-f(x)dF< / \f(x)-g(x)\dF+ / g(x)-f(x) 



dF 



f(x) - f(x) dF < / \f(x) - g(x)\dF + / g(x) - f(x) (dF + dG - dG) 



fix) - f(x) dF<e dF+ g(x) - fix) dG+ / g(x) - fix) \f(x) - g(x)\ dx 



fix) - fix) dF<e dF+ gix) - fix) dG + e g(x) - f(x) dx 



f(x) - fix) dF<e dF+ gix) - fix) dG + e (\g(x)\ + fix) )dx 



fix) - fix) dF<e dF+ / gix) - fix) dG + e( dF + / dG) 



f \f(x) - f(x)\dF <3e+ [J2\0° - 0i\ \Vi{x)\dG 
I fix) - fix) dF<3e + AlJ2 f \ 6 i - e M G - 



(6) 



From ©, we are able to use the samples to estimate 9® though these samples 
are obtained from the distribution F instead of G. Here, we assume that e is 
very small. There is still a problem since the existence of g, \fix) — gix)\ < e, 
is not unique. We think that it is not a real problem because the problem 



10 



should be transferred to a practical problem, the global maximum of likelihood 
function. The problem has been studied partially (Tsai et al. 2007). 
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Figure 2: In this figure, the Bezier spline estimator is adopted to estimate the 
unimodal distribution. The sample size is 180. 
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Figure 3: In this figure, the B-spline estimator is adopted to estimate the 
bimodal distribution. The sample size is 180. 
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Figure 4: In this figure, the Bezier-spline estimator is adopted to estimate the 
bimodal distribution. The sample size is 180. 
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Figure 5: In this figure, the B-spline estimator is adopted to estimate the 
timodal distribution. The sample size is 180. 
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Figure 6: In this figure, the piecewise Bezier spline estimator is adopted to 
estimate the trimodal distribution. The sample size is 180. 
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